ANALYZING TWITTER DATA TO IDENTIFY WORKPLACE RELATED ISSUES IN REAL TIME
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Table of contents\n",
"**Note:** Please open the instructions in a new tab as it can't load the document by left clicking. \n",
"Follow the [instructions](https://ibm.box.com/s/1t4ism8ru8bdsu3q272sqlgknxys6mju) to run the notebook. \n",
"This notebook is divided into the following parts:\n",
"\n",
"[Part 1: Setup](#setup) \n",
"[Part 2: Accessing Twitter API and Scraping the Data](#access) \n",
"[Part 3: Cleaning the Twitter Data](#clean) \n",
"[Part 4: Watson Discovery | NLU](#watson) \n",
"[Part 5: Analyzing Enriched Data](#analyze) \n",
"[Part 6: Create a Watson Knowledge Studio Model](#wks) \n",
"[Part 7: Custom Model](#custom) \n",
"[Part 8: Visualizing the Results on World Map](#visualize) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 1. Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** We need a project token as it lets us to import/export assets from our project. for example the csv file we upload as a part of the assets we uploaded earlier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### INSERT PROJECT TOKEN\n",
"1. Go to the 3-dots and select insert project token\n",
"2. Follow the error message link to the project settings page and create a new key\n",
"3. Come back to the notebook and hit the \"insert project token\" option again\n",
"4. Scroll to the top and run the new cell with your project token"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## 1.1. Importing libraries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** We need to import these libraries in order to run the notebook.\n",
"\n",
"1. Numpy - NumPy is a package in Python used for Scientific Computing. NumPy package is used to perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store values of same datatype. We use this for doing operations on the dataframe we create (twitter_data).\n",
"\n",
"2. Pandas - pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. We use this for convert our csv asset into a dataframe (twitter_data).\n",
"\n",
"3. json - The JSON module is mainly used to convert the python dictionary above into a JSON string that can be written into a file. While the JSON module will convert strings to Python datatypes, normally the JSON functions are used to read and write directly from JSON files.\n",
"\n",
"4. re - This module provides regular expression matching operations. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).\n",
"\n",
"5. OS - The OS module in Python provides a way of using operating system dependent functionality. The functions that the OS module provides allows you to interface with the underlying operating system that Python is running on.\n",
"\n",
"6. datetime and time - In Python, date, time and datetime classes provides a number of function to deal with dates, times and time intervals. Date and datetime are an object in Python, so when you manipulate them, you are actually manipulating objects and not string or timestamps.\n",
"\n",
"7. Geotext - Geotext extracts country and city mentions from text.\n",
"\n",
"8. Twitter - used for importing twitter API and auth from it.\n",
"\n",
"9. IBM-watson - used for connecting to different watson services in teh notebook. for example discovery."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"!pip install --upgrade pandas\n",
"import pandas as pd\n",
"\n",
"import json\n",
"import re\n",
"\n",
"import os\n",
"import datetime\n",
"import time\n",
"\n",
"!pip install geotext\n",
"from geotext import GeoText\n",
"import geotext\n",
"\n",
"!pip install twitter\n",
"import twitter\n",
"\n",
"!pip install --upgrade ibm-watson"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2/3. Alternative - Import Workshop Twitter Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### For this workshop we prepared data for you, so you don't need to signup for a twitter developer account. You can run this cell and skip steps 2 and 3."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** The cell below used the porject token to import the csv data using var: my_file. Then it gets converted into a dataframe var: twitter_data and we show the first 20 entries of the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fetch the file\n",
"my_file = project.get_file(\"workplace_tweets_prepared_data.csv\")\n",
"\n",
"# Read the CSV data file from the object storage into a pandas DataFrame\n",
"my_file.seek(0)\n",
"twitter_data = pd.read_csv(my_file, nrows=1000)\n",
"twitter_data = twitter_data.replace(np.nan, '')\n",
"\n",
"twitter_data.head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 2. Accessing Twitter API and Scraping the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Follow the instructions document to make a Twitter Developer Account, Twitter App, and Generate Keys."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Go to https://developer.twitter.com/en/apps to create an app and get values for these credentials.\n",
"# You'll need to provide them in place of these empty string values that are defined as placeholders to access Twitter API.\n",
"access_token = \"paste your token here\"\n",
"access_token_secret = \"paste your token here\"\n",
"consumer_key = \"paste your token here\"\n",
"consumer_secret = \"paste your token here \""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# See https://developer.twitter.com/en/docs for more information on Twitter's OAuth implementation.\n",
"auth = twitter.oauth.OAuth(access_token, access_token_secret,consumer_key,consumer_secret)\n",
"twitter_api = twitter.Twitter(auth=auth)\n",
"\n",
"# Set this variable to a trending topic, or anything else for that matter. \n",
"# The example query below was a trending topic when this content was being developed and is used throughout the remainder of this notebook.\n",
"queries = ['worker','workplace','workersrights','employer', 'employee','employment','employmentlaw',\n",
" '#worker','#workplace','#workersrights','#employer', '#employee','#employment','#employmentlaw']\n",
"count = 100\n",
"\n",
"statuses = []\n",
"query_text = []\n",
"\n",
"for q in queries:\n",
" search_results = twitter_api.search.tweets(q=q,lang='en',count=count,tweet_mode=\"extended\")['statuses']\n",
" \n",
" statuses.extend(search_results)\n",
" query_text.extend([q]*len(search_results))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 3. Cleaning the Twitter Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Select the valuable fields\n",
"status_texts = [ {'tid': status['id'],'text':status['full_text'],'time': status['created_at'],\n",
" 'lang':status['lang'],'location':status['user']['location'],'place':status['place'],\n",
" 'source':status['source'],'retweeted':status['retweet_count'], \n",
" 'user': status['user']['screen_name']} for status in statuses ]\n",
"\n",
"# Create a data frame\n",
"twitter_data = pd.DataFrame(data=status_texts)\n",
"\n",
"# Extract Country if place is not empty\n",
"twitter_data['place'] = twitter_data['place'].apply(lambda l: None if l == None else l['country'])\n",
"\n",
"# Extract time \n",
"twitter_data[\"time\"] = twitter_data['time'].apply(lambda dt: dt.split(\" \")[1] + \", \" + dt.split(\" \")[2] + \", \" + dt.split(\" \")[5] + \", \" + dt.split(\" \")[3])\n",
"twitter_data[\"time\"] = twitter_data[\"time\"].apply(lambda s: datetime.datetime.strptime(s, '%b, %d, %Y, %H:%M:%S'))\n",
"\n",
"# Extract source link\n",
"twitter_data['source'] = twitter_data[\"source\"].apply(lambda s: s.split(\" \")[1][5:])\n",
"\n",
"# Remove the user name of retweet\n",
"twitter_data[\"text\"] = twitter_data[\"text\"].apply(lambda s: re.sub(r\"^RT @.{0,20}:\",\"\",s) if re.match(r\"^RT @.{0,20}:\",s) else s)\n",
"twitter_data[\"text\"] = twitter_data[\"text\"].apply(lambda s: s.strip())\n",
"\n",
"# Add the query text to the dataframe\n",
"twitter_data['q_text'] = query_text\n",
"\n",
"# Remove duplicate tweets i.e. retweets\n",
"twitter_data = twitter_data[~twitter_data['text'].duplicated()]\n",
"\n",
"# Reset index and rename columns\n",
"twitter_data.reset_index(drop=True,inplace=True)\n",
"twitter_data.rename(columns={'time':'datetime'}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"twitter_data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"twitter_data.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 4. Watson Discovery | Natural Language Understanding (NLU)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Follow the instructions document to create your Watson Discovery Service before continuing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** We access the discovery services using the batch variable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"batch = \"batch_one\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.1. Creating JSON for each tweet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** This converts each tweet into a json obejct as disovery casn only operate on json type objects. we create a new directory in discovery named \"tweets\" and push each tweet in it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## to make int64 serializable for JSON file for datetime column\n",
"def default(o): \n",
" if isinstance(o, np.int64): return int(o) \n",
" else: return str(o)\n",
"\n",
"## creating tweets directory \n",
"if not os.path.isdir('tweets_'+ batch):\n",
" os.mkdir('tweets_'+batch)\n",
"\n",
"## creating json file per tweet\n",
"for i in twitter_data.index:\n",
" with open((\"./tweets_{}/tweet_{}.json\".format(batch,i)),\"w\") as outfile:\n",
" json.dump({\"text\": twitter_data.loc[i,\"text\"],\"datetime\": twitter_data.loc[i,\"datetime\"],\n",
" \"tid\": twitter_data.loc[i,\"tid\"],\"user\": twitter_data.loc[i,\"user\"], \"q_text\": twitter_data.loc[i,\"q_text\"],\n",
" \"source\": twitter_data.loc[i,\"source\"],\n",
" \"location\": twitter_data.loc[i,\"location\"],\"place\": twitter_data.loc[i,\"place\"]}, outfile, default=default) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.2. Initiating Watson Discovery\n",
"\n",
"Before running the cells below complete the corresponding steps in the Instructions (you'll need a watson discovery collection)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** this cell establishes the connection to discovery using ibm-watson then we import the id's from discovery service and get connected."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Setup Discovery API so we can interact with it using this Notebook\n",
"from ibm_watson import DiscoveryV1\n",
"\n",
"discovery = DiscoveryV1(version='2019-03-25',\n",
" iam_apikey=\"YOUR_API_KEY\", # Enter your credentials\n",
" url=\"https://gateway.watsonplatform.net/discovery/api\")\n",
"\n",
"env_id = 'YOUR_ENVIRONMENT_ID' # Enter your credentials\n",
"col_id = 'YOUR_COLLECTION_ID' # Enter your credentials\n",
"\n",
"discovery.set_default_headers({'x-watson-learning-opt-out': \"true\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** here we are uploading the tweets to discovery service."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Upload each Tweet to Watson Discovery so it can \"enrich\" them (meaning it will identify categories, entities, and the sentiment of the tweet). \n",
"# For further analysis we could also include the Watson Tone Analyzer to identify emotions, Watson Personality Insights to identify the users Traits, etc.\n",
"for ind, file_object in enumerate(os.listdir(\"./tweets_{}/\".format(batch))):\n",
" if file_object.endswith(\".json\"):\n",
" document_json = (open(os.path.join(\"./tweets_{}/\".format(batch),file_object)).read())\n",
" doc_id = \"tweet\" + str(ind)\n",
" add_doc = discovery.update_document(environment_id=env_id, collection_id=col_id, document_id = doc_id,file = document_json, \n",
" file_content_type = \"application/json\", filename = file_object).get_result()\n",
" else:\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"discovery.get_collection(env_id, col_id).get_result()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 5. Analyzing Enriched Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE** The cell below creates an array of fields we want to use for enrichment after they got put in discovery. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Select fields to extract from enriched data\n",
"flds = ['tid','user','text','place','location','datetime',\n",
" 'enriched_text.sentiment.document.label','enriched_text.sentiment.document.score',\n",
" 'enriched_text.categories.label','enriched_text.categories.score',\n",
" 'enriched_text.entities.type','enriched_text.entities.relevance','enriched_text.entities.text',\n",
" 'enriched_text.entities.sentiment.label','enriched_text.entities.sentiment.score']\n",
"flds = ','.join(flds)\n",
"flds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** The cell below allows discovery to run queries on the fields. Then it saves that to a dataframe named enriched_data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Obtain enriched data from Watson Discovery\n",
"query_result = discovery.query(environment_id=env_id, collection_id=col_id, return_fields=flds, count=1000).get_result()['results']\n",
"enriched_data = pd.DataFrame(query_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** The cell below applies the enrichments discovery uses to refine the data, these are categories, entities and sentiment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Clean the dataframe\n",
"enriched_data['categories'] = enriched_data['enriched_text'].apply(lambda d: d.pop('categories') if 'categories' in d else np.nan)\n",
"enriched_data['entities'] = enriched_data['enriched_text'].apply(lambda d: d.pop('entities') if 'entities' in d else np.nan)\n",
"enriched_data['sentiment'] = enriched_data['enriched_text'].apply(lambda x: x.pop('sentiment')['document'] if 'sentiment' in x else np.nan)\n",
"\n",
"enriched_data.drop(columns={'enriched_text','id','result_metadata'},inplace=True)\n",
"\n",
"enriched_data['sentiment_score'] = enriched_data['sentiment'].apply(lambda x: x.pop('score') if not isinstance(x, float) else 0)\n",
"enriched_data['sentiment'] = enriched_data['sentiment'].apply(lambda x: x.pop('label') if not isinstance(x, float) else 'neutral')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.1. Applying First Filter to Extract Non-positive Tweets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First filter: non-positive tweets. We are interested in the neutral/negative ones for our study.\n",
"e_data_filtered = enriched_data.copy()[enriched_data['sentiment'] != 'positive'].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"There are {} non-positive tweets.\".format(len(e_data_filtered)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e_data_filtered.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.2. Applying Second Filter to Extract Relevant Tweets by Category and Content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Second filter: by relevant categories and words. We want to remove any unrelated tweets.\n",
"def categories_filter(ct):\n",
" relevant_cats_one = set(['/society/work'])\n",
" relevant_cats_two = set(['/society/work/unemployment','/society/crime/sexual offence','/society/crime/personal offense',\n",
" '/society/welfare/healthcare','/society/unrest and war',\n",
" '/business and industrial/construction','/religion and spirituality/islam'])\n",
" sports = re.compile(\"(auto)\")\n",
" irrelevant_cats = re.compile(\"(careers)|(education)|(robotics)|(investing)|(shopping)|(family)|(travel)|(science)|(health and fitness)|(art )\")\n",
" irrelevant_cats_one = re.compile(\"(careers)|(sports)|(education)|(robotics)|(investing)|(shopping)|(family)|(travel)|(science)|(health and fitness)|(art )\")\n",
" irrelevant_cats_two = re.compile(\"(careers)|(sports)|(education)|(art )|(news)|(plans)|(robotics)|(investing)|(shopping)|(family)|(travel)|(business operations)|(health and fitness)\")\n",
" \n",
" if (any(i in ct for i in relevant_cats_one))&(sports.search(\" \".join(ct)) != None)&(irrelevant_cats.search(\" \".join(ct)) == None):\n",
" return True\n",
" elif (any(i in ct for i in relevant_cats_one))&(irrelevant_cats_one.search(\" \".join(ct)) == None):\n",
" return True\n",
" elif (any(i in ct for i in relevant_cats_two))&(irrelevant_cats_two.search(\" \".join(ct)) == None):\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
"def stop_words_filter(t):\n",
" irrelevant_words = ['jobsearch', 'interview', 'apply now','stock','employment law', 'employmentlaw', 'legislation',\n",
" 'research', 'study','survey','reforms','nigga','tweet','whoopee','announcement','read here','news','government',\n",
" 'trump','brexit','hard worker','culture','leadership','president','regulat','federal','for more on ','blog post',\n",
" 'need a job','tax','hr']\n",
" irre_words_one= re.compile(\"|\".join(irrelevant_words),re.IGNORECASE)\n",
" irre_words_two =re.compile(\"(how to)|(how do)|(tips for)|(tips to)|(under (?s)(.*) act)|(text [[0-9]+)|(job(?s)(.*) wanted)\", re.IGNORECASE)\n",
" \n",
" re_words=re.compile(\"discrimmination|protest|fatalindustrialinjury|injur|factory\",re.IGNORECASE)\n",
" \n",
" if re_words.search(t):\n",
" return True\n",
" elif irre_words_one.search(t) or irre_words_two.search(t):\n",
" return False\n",
" else:\n",
" return True"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Drop Tweets that didn't have enough data to be enriched (NaN). These are usually tweets that had very few words or nouns for Discovery to interpret.\n",
"for index, item in enumerate(e_data_filtered['categories']):\n",
" if(not isinstance(item, list)):\n",
" print(\"DROP DATA:\")\n",
" print(index, item)\n",
" for type in e_data_filtered:\n",
" print(str(type) + ': ' + str(e_data_filtered[type][index]))\n",
" e_data_filtered = e_data_filtered.drop(index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e_data_filtered['cat_labels'] = e_data_filtered['categories'].apply(lambda l: set([i['label'] for i in l]))\n",
"e_data_filtered['cat_relevant'] = e_data_filtered['cat_labels'].apply(categories_filter)\n",
"e_data_filtered['word_relevant'] = e_data_filtered['text'].apply(stop_words_filter)\n",
"e_data_filtered = e_data_filtered[e_data_filtered['cat_relevant'] & e_data_filtered['word_relevant']].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"There are about {} relevant tweets among non-positive tweets.\".format(len(e_data_filtered)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e_data_filtered.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.3. Applying Third Filter to Extract Tweets with Category Confidence Score Above 70%"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Third filter: category confidence score filtering\n",
"score_cats = ['/society/work','/business and industrial/business operations/human resources/compensation and benefits','law, govt and politics']\n",
"e_data_filtered['work_score'] = e_data_filtered['categories'].apply(lambda l: np.sum([i['score'] if i['label'] == score_cats[0] else 0 for i in l]))\n",
"e_data_filtered['bene_score'] = e_data_filtered['categories'].apply(lambda l: np.sum([i['score'] if i['label'] == score_cats[1] else 0 for i in l]))\n",
"e_data_filtered['law_score'] = e_data_filtered['categories'].apply(lambda l: np.sum([i['score'] if i['label'] == score_cats[2] else 0 for i in l]))\n",
"\n",
"e_data_filtered = e_data_filtered[(e_data_filtered['work_score'] >= 0.7)|(e_data_filtered['work_score'] >= 0.7)|\n",
" (e_data_filtered['law_score'] >= 0.7)|(e_data_filtered['work_score'] == 0)]\n",
"# Dropping unnecessary columns and resetting index\n",
"e_data_filtered.drop(columns=['categories','cat_relevant','word_relevant'],inplace=True)\n",
"e_data_filtered.reset_index(drop=True,inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"There are {} relevant tweets among non-positive tweets with 70% relevant category confidence.\".format(len(e_data_filtered)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e_data_filtered.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.4. Removing # and @ Signs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e_data_filtered['text'] = e_data_filtered['text'].apply(lambda x: re.sub(r'[@#]','',x))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e_data_filtered.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 6. Create a Watson Knowledge Studio Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Go to the instructions pdf document and follow the steps to train a Knowledge Studio Model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" '''\n",
" This cell doesn't do anything, but you can paste you model id and iam_apikey from step 4.2 here for safe keeping...\n",
" [ model_id: ]\n",
" [ iam_apikey: ]\n",
" \n",
"'''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 7. Custom Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before running the cells, go back to your discovery service and create a new collection for our custom model. We will upload the non-positive tweets we identified here for more vigorous analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ibm_watson import DiscoveryV1\n",
"\n",
"discovery_custom = DiscoveryV1(version='2018-10-15',\n",
" iam_apikey=\"YOUR_API_KEY\", # Enter your credentials (same as above, step 4.2) \n",
" url=\"https://gateway.watsonplatform.net/discovery/api\")\n",
"\n",
"env_id_custom = 'YOUR_ENVIRONMENT_ID' # Enter your credentials (same as above, step 4.2)\n",
"col_id_custom = 'YOUR_COLLECTION_ID' # Enter your credentials (NEW)\n",
"\n",
"conf_id_custom = 'YOUR_CONFIGURATION_ID' # Enter your credentials (NEW)\n",
"\n",
"discovery_custom.set_default_headers({'x-watson-learning-opt-out': \"true\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.1. Creating JSON for each tweet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Save the tweets in JSON format\n",
"batch = \"batch_one\"\n",
"\n",
"## to make int64 serializable for JSON file for datetime column\n",
"def default(o): \n",
" if isinstance(o, np.int64): return int(o) \n",
" else: return str(o)\n",
"\n",
"## creating tweets directory \n",
"if not os.path.isdir('filtered_tweets_'+ batch):\n",
" os.mkdir('filtered_tweets_'+batch)\n",
"\n",
"## creating json file per tweet\n",
"for i in e_data_filtered.index:\n",
" with open((\"./filtered_tweets_{}/tweet_{}.json\".format(batch,i)),\"w\") as outfile:\n",
" json.dump({\"text\": e_data_filtered.loc[i,\"text\"],\"tid\": e_data_filtered.loc[i,\"tid\"]}, outfile, default=default) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.2 Sending the Data to the Default Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Discovery won't let us apply our custom Watson Knowledge Studio model unless there is data in the collection, so we'll upload some tweets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Upload data to discovery (JSON formatted)\n",
"batch = \"batch_one\"\n",
"for ind, file_object in enumerate(os.listdir(\"./filtered_tweets_{}/\".format(batch))):\n",
" if file_object.endswith(\".json\"):\n",
" document_json = (open(os.path.join(\"./filtered_tweets_{}/\".format(batch),file_object)).read())\n",
" doc_id = \"tweet\" + str(ind)\n",
" add_doc = discovery_custom.update_document(environment_id=env_id_custom, collection_id=col_id_custom,\n",
" document_id = doc_id,file = document_json,\n",
" file_content_type = \"application/json\", filename = file_object).get_result()\n",
" else:\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"discovery_custom.get_collection(env_id_custom, col_id_custom).get_result()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.3. Adding a Custom Model to Discovery"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create custom configuration (a 'configuration' tells discovery which data it should enrich and how it should enrich it)\n",
"custom_conf = discovery_custom.get_configuration(environment_id=env_id_custom,configuration_id=conf_id_custom).get_result()\n",
"custom_conf['enrichments'][0]['destination_field'] = 'wks_enriched_text'\n",
"\n",
"default_conf = {'destination_field': 'enriched_text','enrichment': 'natural_language_understanding',\n",
" 'options': {'features': {'categories': {},'concepts': {'limit': 8},\n",
" 'entities': {'emotion': False, 'limit': 50, 'sentiment': True},\n",
" 'sentiment': {'document': True}}},'source_field': 'text'}\n",
"\n",
"custom_conf['enrichments'].append(default_conf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Add our configuration to Watson Discovery\n",
"discovery_custom.update_configuration(environment_id=env_id_custom,configuration_id=conf_id_custom,name='twitter_conf3',enrichments=custom_conf['enrichments'])\n",
"\n",
"# Tell the discovery collection to use our custom configuration on new documents\n",
"updated_collection = discovery_custom.update_collection(env_id_custom, collection_id=col_id_custom, configuration_id=conf_id_custom, name='Workshop Custom').get_result()\n",
"print(json.dumps(updated_collection, indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now that the custom model has been added, follow the instructions to add our trained Watson Knowledge Studio model to the custom configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Since the documents we just uploaded were enriched using the default model, we will delete them and upload again.\n",
"query = discovery_custom.query(environment_id=env_id_custom, collection_id=col_id_custom ,query='*.*', count=50)\n",
"\n",
"for doc in query.result['results']:\n",
" delete_doc = discovery_custom.delete_document(env_id_custom, col_id_custom, doc['id']).get_result()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.4 Sending the Data to the Custom Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Add tweets to enrich with our custom model\n",
"batch = \"batch_one\"\n",
"for ind, file_object in enumerate(os.listdir(\"./filtered_tweets_{}/\".format(batch))):\n",
" if file_object.endswith(\".json\"):\n",
" document_json = (open(os.path.join(\"./filtered_tweets_{}/\".format(batch),file_object)).read())\n",
" doc_id = \"tweet\" + str(ind)\n",
" add_doc = discovery_custom.update_document(environment_id=env_id_custom, collection_id=col_id_custom,\n",
" document_id = doc_id,file = document_json,\n",
" file_content_type = \"application/json\", filename = file_object).get_result()\n",
" else:\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"discovery_custom.get_collection(env_id_custom, col_id_custom).get_result()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.5. Analyzing Enriched Data from Custom Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Select fields to extract from enriched data\n",
"c_flds = ['tid','text',\n",
" 'enriched_text.sentiment.document.label','enriched_text.sentiment.document.score',\n",
" 'enriched_text.categories.label','enriched_text.categories.score',\n",
" 'enriched_text.entities.type','enriched_text.entities.text',\n",
" 'wks_enriched_text.entities.type','wks_enriched_text.entities.text']\n",
"c_flds = ','.join(c_flds)\n",
"c_flds"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"custom_data = pd.DataFrame(discovery_custom.query(environment_id=env_id_custom,\n",
" collection_id=col_id_custom,return_fields=c_flds,count = 100).get_result()['results'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract and process default categories\n",
"custom_data['wks_categories'] = custom_data['enriched_text'].apply(lambda d: d.pop('categories') if 'categories' in d else np.nan)\n",
"custom_data['wks_categories'] = custom_data['wks_categories'].apply(lambda l: set([i['label'].split('/')[-1] for i in l]))\n",
"\n",
"# Extract and process entities and custom categories\n",
"custom_data['def_entities'] = custom_data['enriched_text'].apply(lambda d: d.pop('entities') if 'entities' in d else np.nan)\n",
"custom_data['wks_entities'] = custom_data.loc[~custom_data['wks_enriched_text'].isna(),'wks_enriched_text'].apply(lambda d: d.pop('entities') if 'entities' in d else np.nan)\n",
"\n",
"# Extract and process sentiment\n",
"custom_data['wks_sentiment'] = custom_data['enriched_text'].apply(lambda d: d.pop('sentiment')['document'])\n",
"custom_data['sent_score'] = custom_data['wks_sentiment'].apply(lambda x: x.pop('score'))\n",
"custom_data['sent'] = custom_data['wks_sentiment'].apply(lambda x: x.pop('label'))\n",
"\n",
"# Identify main issue of each tweet\n",
"custom_data.loc[custom_data['wks_entities'].isnull(),'wks_entities'] = custom_data.loc[custom_data['wks_entities'].isnull(),'wks_entities'].apply(lambda x: [])\n",
"custom_data['issue'] = custom_data['wks_entities'].apply(lambda l: set([d['type'] for d in l]))\n",
"\n",
"# Identify Location, Company, Organization in each tweet\n",
"ent_types = ['Company','Organization','Person','Location']\n",
"custom_data.loc[custom_data['def_entities'].isnull(),'def_entities'] = custom_data.loc[custom_data['def_entities'].isnull(),'def_entities'].apply(lambda x: [])\n",
"\n",
"for t in ent_types:\n",
" custom_data['wks_'+t] = custom_data['def_entities'].apply(lambda l: set([i['text'] for i in l if (i['type'] == t)]))\n",
"\n",
"# Drop unneccessary columns \n",
"custom_data.drop(columns=['enriched_text','id','result_metadata','wks_enriched_text','wks_sentiment','def_entities','wks_entities'],inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"custom_data.head(50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.6. Merging the datasets and Deriving the Correct Location Country"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"merged_data = pd.merge(e_data_filtered[['tid','user','location','place','work_score','datetime']],custom_data,on=['tid'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert column object types to corresponding data types\n",
"str_cols = ['wks_categories', 'issue', 'wks_Company','wks_Organization', 'wks_Person']\n",
"for c in str_cols:\n",
" merged_data[c] = merged_data[c].apply(lambda s: \", \".join(s))\n",
" \n",
"merged_data['datetime']= merged_data['datetime'].apply(lambda d: pd.to_datetime(d.split(\" \")[0], format='%Y-%m-%d'))\n",
"merged_data.loc[merged_data['wks_Company'].isna(),'wks_Company'] = merged_data.loc[merged_data['wks_Company'].isna(),'wks_Organization'].fillna(value=\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Identify country for each of the locations\n",
"locations = ['wks_Location','place','location']\n",
"for l in locations:\n",
" merged_data[l] = merged_data[l].apply(lambda s: \"\".join(GeoText(str(s)).country_mentions.keys()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"merged_data.sort_values(by=['wks_Location','wks_Company','issue','place'],ascending=False,inplace=True)\n",
"merged_data.drop(columns=['wks_Organization','location'],inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"merged_data.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Save the analyzed data to Cloud Object Storage (COS)\n",
"project.save_data('new_merged_data.csv',merged_data.to_csv(index=False),overwrite=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# 8. Visualizing the Results on World Map"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now that your data is saved, lets go and build a dashboard!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# END"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}